Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inferrability of getproperty(::Row2, ::Symbol) #753

Merged
merged 1 commit into from
Oct 16, 2020
Merged

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Oct 16, 2020

Fixes #752. This is a case of us not providing just not quite enough
information to the compiler, along with the compiler itself being too
clever. The default for CSV.Rows is to treat each column as
Union{String, Missing}, which results in the V type parameter of
CSV.Rows being CSV.PosLen, instead of Any. If that's the case, we
should get pretty good inferrability for getproperty(::Row2, ::Symbol), because we should be able to know the return value will at
least be Union{String, Missing}. This knowledge, however, was trapped
in the "csv domain" and not expressed clearly enough to the compiler. It
inspected Tables.getcolumn(::Row2, nm::Symbol) and saw that it called
Tables.getcolumn(::Row2, i::Int), which in turn called
Tables.getcolumn(::Row2, T, i, nm). This is all fine an expected,
except that when we started supporting non-String types for CSV.Rows
(i.e. you can pass in whatever type you want and we'll parse it directly
from the file for each row), we added an additional typed
Tables.getcolumn method that handled all the non-String columns. Oops.
Now the compiler is confused because from Tables.getcolumn(::Row2, nm::Symbol) it knows that it can return missing, a String, or if we
call this third method, it'll return an instance of our V type
parameter, which, if you'll remember, in the default case is
CSV.PosLen, or more simply, UInt64. So we ended up with a return
type of Union{Missing, UInt64, String}, which makes downstream
operations even trickier to figure out.

Luckily, the solution here is to just help connect the dots for the
compiler: i.e. define specialize methods that dispatch on V,
specifically when V === UInt64. Then the compiler will see/know that
we will only ever call the Union{String, Missing} method and can
ignore the custom types codepath. This PR also rearranges a few
@inbounds uses since we can avoid the bounds checks further down the
stack once we've checked them higher up.

Fixes #752. This is a case of us not providing just not quite enough
information to the compiler, along with the compiler itself being too
clever. The default for `CSV.Rows` is to treat each column as
`Union{String, Missing}`, which results in the `V` type parameter of
`CSV.Rows` being `CSV.PosLen`, instead of `Any`. If that's the case, we
should get pretty good inferrability for `getproperty(::Row2,
::Symbol)`, because we should be able to know the return value will at
least be `Union{String, Missing}`. This knowledge, however, was trapped
in the "csv domain" and not expressed clearly enough to the compiler. It
inspected `Tables.getcolumn(::Row2, nm::Symbol)` and saw that it called
`Tables.getcolumn(::Row2, i::Int)`, which in turn called
`Tables.getcolumn(::Row2, T, i, nm)`. This is all fine an expected,
except that when we started supporting non-String types for `CSV.Rows`
(i.e. you can pass in whatever type you want and we'll parse it directly
from the file for each row), we added an additional typed
`Tables.getcolumn` method that handled all the non-String columns. Oops.
Now the compiler is confused because from `Tables.getcolumn(::Row2,
nm::Symbol)` it knows that it can return `missing`, a `String`, or if we
call this third method, it'll return an instance of our `V` type
parameter, which, if you'll remember, in the default case is
`CSV.PosLen`, or more simply, `UInt64`. So we ended up with a return
type of `Union{Missing, UInt64, String}`, which makes downstream
operations even trickier to figure out.

Luckily, the solution here is to just help connect the dots for the
compiler: i.e. define specialize methods that dispatch on `V`,
specifically when `V === UInt64`. Then the compiler will see/know that
we will only ever call the `Union{String, Missing}` method and can
ignore the custom types codepath. This PR also rearranges a few
`@inbounds` uses since we can avoid the bounds checks further down the
stack once we've checked them higher up.
@codecov
Copy link

codecov bot commented Oct 16, 2020

Codecov Report

Merging #753 into master will decrease coverage by 0.09%.
The diff coverage is 77.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #753      +/-   ##
==========================================
- Coverage   85.83%   85.73%   -0.10%     
==========================================
  Files          10       10              
  Lines        1948     1956       +8     
==========================================
+ Hits         1672     1677       +5     
- Misses        276      279       +3     
Impacted Files Coverage Δ
src/rows.jl 92.59% <77.77%> (-1.06%) ⬇️
src/precompile.jl 96.42% <0.00%> (-3.58%) ⬇️
src/write.jl 85.76% <0.00%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f6ef10...b545cf8. Read the comment docs.

@quinnj quinnj merged commit 74919d0 into master Oct 16, 2020
@quinnj quinnj deleted the jq/752 branch October 16, 2020 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regressions CSV.Rows since 0.5?
1 participant